Similarity-based Deduplication for Databases

نویسندگان

Lianghong Xu

Andrew Pavlo

Sudipta Sengupta

Gregory R. Ganger

چکیده

dDedup is a similarity-based deduplication scheme for on-line database management systems (DBMSs). Beyond block-level compression of individual database pages or operation log (oplog) messages, as used in today’s DBMSs, dDedup uses byte-level delta encoding of individual records within the database to achieve greater savings. dDedup’s single-pass encoding method can be integrated into the storage and logging components of a DBMS to provide two benefits: (1) reduced size of data stored on disk beyond what traditional compression schemes provide, and (2) reduced amount of data transmitted over the network for replication services. To evaluate our work, we implemented dDedup in a distributed NoSQL DBMS and analyzed its properties using four real datasets. Our results show that dDedup achieves up to 37× reduction in the storage size and replication traffic of the database on its own and up to 61× reduction when paired with the DBMS’s block-level compression. dDedup provides both benefits with negligible effect on DBMS throughput or client latency (average and tail). Acknowledgements: We thank the members and companies of the PDL Consortium (including Avago, Citadel, EMC, Facebook, Google, Hewlett-Packard Labs, Hitachi, Intel, Microsoft Research, MongoDB, NetApp, Oracle, Samsung, Seagate, Two Sigma, Western Digital) for their interest, insights, feedback, and support. This research was sponsored in part by Intel as part of the Intel Science and Technology Center for Cloud Computing (ISTC-CC) and MongoDB Incorporated. Experiments were enabled by generous hardware donations from Intel, NetApp, and APC.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Approach Based on Artificial Neural Network for Data Deduplication

Data quality problems arise with the constantly increasing quantity of data. The quality of data stored in realworld databases are assured by the vital data cleaning process. Several research fields like knowledge discovery in databases, data warehousing, system integration and eservices often encounter data cleaning problems. The fundamental element of data cleaning is usually termed as dedupl...

متن کامل

A Novel Deduplication Technique Using an Evolutionary Approach

The process which identifies the records that refers to the same entity in data storage is known as record deduplication. UDD can effectively identify duplicates of different web databases. Initially from the non duplicate record set, the two different classifiers, a Weighted WCSS is used for deduplication. The approach joins different pieces of attribute with similarity function extracted from...

متن کامل

A Heuristic Approach to Record Deduplication

Databases and database related technologies are having a major impact on the growing use of computers. Many global data repositories collect data from various data sources, due to this the chances of duplicates in repositories are more. The duplicate present in database is the result of misleading words and different writing styles. The presence of duplicate records decreases the system perform...

متن کامل

SiLo: A Similarity-Locality based Near-Exact Deduplication Scheme with Low RAM Overhead and High Throughput

Data Deduplication is becoming increasingly popular in storage systems as a space-efficient approach to data backup and archiving. Most existing state-of-the-art deduplication methods are either locality based or similarity based, which, according to our analysis, do not work adequately in many situations. While the former produces poor deduplication throughput when there is little or no locali...

متن کامل

Online Deduplication for Distributed Databases

The rate of data growth outpaces the decline of hardware costs, and there has been an ever-increasing demand in reducing the storage and network overhead for online database management systems (DBMSs). The most widely used approach for data reduction in DBMSs is blocklevel compression. Although this method is simple and effective, it fails to address redundancy across blocks and therefore leave...

متن کامل

Secure and Efficient Client and Server Side Data Deduplication to Reduce Storage in Remote Cloud Computing Systems

Duplication of data in storage systems is becoming increasingly common problem. The system introduces I/O Deduplication, a storage optimization that utilizes content similarity for improving I/O performance by eliminating I/O operations and reducing the mechanical delays during I/O operations and shares data with existing users if Deduplication found on the client or server side. I/O Deduplicat...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2016

Similarity-based Deduplication for Databases

نویسندگان

چکیده

منابع مشابه

An Approach Based on Artificial Neural Network for Data Deduplication

A Novel Deduplication Technique Using an Evolutionary Approach

A Heuristic Approach to Record Deduplication

SiLo: A Similarity-Locality based Near-Exact Deduplication Scheme with Low RAM Overhead and High Throughput

Online Deduplication for Distributed Databases

Secure and Efficient Client and Server Side Data Deduplication to Reduce Storage in Remote Cloud Computing Systems

عنوان ژورنال:

اشتراک گذاری